-
Notifications
You must be signed in to change notification settings - Fork 74k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Intel Mkl] Parallel BiasAddGrad op with eigen intra thread pool #26426
[Intel Mkl] Parallel BiasAddGrad op with eigen intra thread pool #26426
Conversation
influences the training performance of corresponding models. We provide a optimized parallel implementation of BiasAddGrad op. Change-Id: I3a8da878eea67a4903a3b68302c6c86c3e536025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EDIT: I prepared this benchmark myself, should be in GitHub soon.
Original:
Could you also add bias_op_test.cc with micro benchmarks, you can use https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/sparse_xent_op_test.cc as an example. I will be able to use them to compare improvements with clang toolchain.
tensorflow/core/kernels/bias_op.cc
Outdated
#include "tensorflow/core/framework/bounds_check.h" | ||
#include "tensorflow/core/framework/numeric_op.h" | ||
#include "tensorflow/core/framework/op_kernel.h" | ||
#include "tensorflow/core/framework/register_types.h" | ||
#include "tensorflow/core/framework/tensor.h" | ||
#include "tensorflow/core/util/tensor_format.h" | ||
#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor" | ||
|
||
#include "tensorflow/core/util/work_sharder.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please group all tensorflow includes together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
tensorflow/core/kernels/bias_op.cc
Outdated
} | ||
} | ||
|
||
private: | ||
TensorFormat data_format_; | ||
|
||
// Modified for performance tune :: new funcs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can strip all "Modified for performance" comments, they are not really helpful for the reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
tensorflow/core/kernels/bias_op.cc
Outdated
inline void InplaceVecAdd(T* X, const T* Y, const int64 length) { | ||
//#pragma simd | ||
for (int64 i = 0; i < length; i++) { | ||
X[i] = X[i] + Y[i]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know if this pragma works with clang? This loop can be expressed as simple Eigen expression, something like:
using X = Eigen::Map<Eigen::Array<T, Dynamic, 1>>;
using Y = Eigen::Map<const Eigen::Array<T, Dynamic, 1>>
X(x, length) += Y(y, length);
I probably messed up with some constructor parameters, but similar idea is used in https://bitbucket.org/eigen/eigen/src/9c300336de9a096c7c4c3b230a6ae9f49fa17aea/unsupported/Eigen/CXX11/src/Tensor/TensorBlock.h#lines-494:506
This will be vectorized/unrolled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, repalced them with eigen apis. And I also figured there is no need to keep the InplaceVecAdd()
and VecSumReduce()
fuctions. I inlined them into the lambda function.
tensorflow/core/kernels/bias_op.cc
Outdated
|
||
// Modified for performance tune :: new funcs | ||
// Apply X[0:length-1] = X[0:length-1] + Y[0:length-1]; | ||
inline void InplaceVecAdd(T* X, const T* Y, const int64 length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
function arg names must be 'x' and 'y'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has been removed, so that's not a problem now.
tensorflow/core/kernels/bias_op.cc
Outdated
} | ||
} | ||
// Return sum(X[0:length-1]) | ||
inline T VecSumReduce(const T* X, const int64 length) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a. method arg name
b. it also could be return Eigen::Map<...>(x, length).sum();
I didn't find any other simd pragmas in Tensorflow codebase, so my assumption it's not supported by all compilers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function has been removed, so that's not a problem now.
tensorflow/core/kernels/bias_op.cc
Outdated
.template cast<typename AccumulatorType<T>::type>() | ||
.reshape(two_dims) | ||
.sum(reduction_axis) | ||
.template cast<T>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think all performance problems with this grad were fixed in https://bitbucket.org/eigen/eigen/commits/948b43eaf2c5c35c988308d120a8a24781c63de6, what version of TF you used in benchmarks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to my measurement, I believe my implementation is still faster than this commit you mentioned. So could I keep using my implementation.
I did run benchmarks from https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/bias_op_test.cc and got pretty good results already, feel free to add more benchmarks, probably you want to test NCHW as well.
|
1. Group all tensorflow includes together. 2. Delete unnecessary comments. 3. Using eigen ops instead. Change-Id: I67edcd71cb4feeaf8ab0c1820d2c011e3409344d
Based on your PR I've submitted very similar change in d5f6595, apparently this "reduce all outer dimension" is pretty common in other gradient kernels (e.g. in FusedBatchNorm https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/kernels/fused_batch_norm_op.cc#L215). Can you please move the code for NCHW case to Currently it takes Tensor as an input, but it could be any Eigen expression, I'll work on it after this PR will be merged. |
Hi @ezhulenev , |
Yeah, that’s the idea. I think you can skip the general case, and just add
support for (d1, ... , dn) -> dm where m is in [2, n-1] range. Reducing
inner most dimensions are already fast in Eigen, I added optimized case for
reducing outer dimensions. We just need one more optimized version for
reducing all-except “middle” dimension.
…On Mon, Mar 11, 2019 at 11:57 PM Letian ***@***.***> wrote:
Hi @ezhulenev <https://github.com/ezhulenev> ,
I have read the codes in ReduceOuterDimensions.
So, you want me to add a functor for situations like:
(32, 11, 256, 256) -> (11);
or maybe more general like:
(D1, D2, ... , DN) -> (DM) (where M belong to set [1,N])
and then ref the new functor to do the reduce?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#26426 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABHrajX_gC0cA2ojKyuJ4NRtE2P_La-zks5vV0_ugaJpZM4bifIc>
.
|
@ezhulenev OK |
@ezhulenev |
|
reduce in tensorflow/core/kernels/redux_functor.h 2. Opt ReduceOuterDimensions for large inner dim. 3. Rewrite NCHW BiasAddGradOp with ReduceOuterDimensions.
reduce in tensorflow/core/kernels/redux_functor.h 2. Opt ReduceOuterDimensions for large inner dim. 3. Rewrite NCHW BiasAddGradOp with ReduceOuterDimensions.
This fails internally with a error:
didn't have time to debug it yet |
@ezhulenev |
@ezhulenev is there anything I can do? |
I'll try to prepare reproducible test, right now it's a part of a large model and it's hard to tell whats going on. |
@ezhulenev Thx. Just tell me if there is anything to do. We need this pr merged ASAP, cause several model is depending on this one. |
for (int i = num_dims - num_reduce_dims; i < num_dims; ++i) | ||
inner_dim *= input_dims[i]; | ||
|
||
if (1 == inner_dim) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't it a bug? '32x32x1' should be reduced to scalar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly! It should be "outer_dim" here.
Thank you very much for pointing out the bug!
Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tensorflow/python/keras:convolutional_test
ValueError: Shapes (3,) and (1, 1, 1, 1, 3) are incompatible
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
PiperOrigin-RevId: 240170461
The BiasAddGrad op is running on single thread, which badly influences the training performance of corresponding models.
We provide a optimized parallel implementation of BiasAddGrad op.
The following is the time cost for original and optimized BiasAddGrad op of some corresponding models. All run on skx-8180 single socket:
YoloV2 (batch size 16) : 263.76ms -> 12.06ms
Trans-LT (batch size 1024) : 68.61ms -> 3.09ms
Inception-ResV2 (batch size 64) : 334.96ms -> 73.18ms
NCF (batch size 1024) : 1.34ms -> 0.243ms
NCF-mlperf (batch size 1024) : 0.867ms -> 0.341ms
MaskRCNN (batch size 1) : 370.57ms -> 15.23ms
DCGAN (batch size 32) : 10.46ms -> 0.512ms